Text Block Recognition in Multi-Oriented Handwritten Documents
نویسندگان
چکیده
Automatic detection of text blocks is an important step before applying OCR or word-spotting techniques to document images. Our approach focusses on handwritten (historical) documents and uses the Gabor Transformation to facilitate this task. Apart from the main text, which often consists of rectangular shaped text blocks, marginalia are of special interest here. These areas are generally unconstrained regarding size, dimensions or orientation. Our system detects text blocks of at least three lines, representing a moderately homogeneous region regarding orientation and distances of text lines. Experiments on 40 documents, written in different european and asian writing systems, show good results, depending on the complexity of the layout. Keywords-document layout analysis; manuscript; text block recognition; Gabor Transform
منابع مشابه
Connected Component Based Word Spotting on Persian Handwritten image documents
Word spotting is to make searchable unindexed image documents by locating word/words in a doc-ument image, given a query word. This problem is challenging, mainly due to the large numberof word classes with very small inter-class and substantial intra-class distances. In this paper, asegmentation-based word spotting method is presented for multi-writer Persian handwritten doc-...
متن کاملDocument seal detection using GHT and character proximity graphs
This paper deals with automatic detection of seal (stamp) from documents with cluttered background. Seal detection involves a difficult challenge due to its multi-oriented nature, arbitrary shape, overlapping of its part with signature, noise, etc. Here, a seal object is characterized by scale and rotation invariant spatial feature descriptors computed from recognition result of individual conn...
متن کاملOff-line Arabic Handwritten Recognition Using a Novel Hybrid HMM-DNN Model
In order to facilitate the entry of data into the computer and its digitalization, automatic recognition of printed texts and manuscripts is one of the considerable aid to many applications. Research on automatic document recognition started decades ago with the recognition of isolated digits and letters, and today, due to advancements in machine learning methods, efforts are being made to iden...
متن کاملWord level Script Identification from Bangla and Devanagri Handwritten Texts mixed with Roman Script
India is a multi-lingual country where Roman script is often used alongside different Indic scripts in a text document. To develop a script specific handwritten Optical Character Recognition (OCR) system, it is therefore necessary to identify the scripts of handwritten text correctly. In this paper, we present a system, which automatically separates the scripts of handwritten words from a docum...
متن کاملDistinction between handwritten and machine-printed text based on the bag of visual words model
In a variety of documents, ranging from forms to archive documents and books with annotations, machine printed and handwritten text may coexist in the same document image, raising significant issues within the recognition pipeline. It is, therefore, necessary to separate the two types of text so that it becomes feasible to apply different recognition methodologies to each modality. In this pape...
متن کامل